Normalized Google distance

The normalized Google distance (NGD) is a semantic similarity measure derived from the number of hits returned by the Google search engine for a given set of keywords.^[1] Keywords with the same or similar meanings in a natural language sense tend to be "close" in units of normalized Google distance, while words with dissimilar meanings tend to be farther apart.

Specifically, the NGD between two search terms x and y is

\operatorname {NGD} (x,y)={\frac {\max\{\log f(x),\log f(y)\}-\log f(x,y)}{\log N-\min\{\log f(x),\log f(y)\}}}

where N is the total number of web pages searched by Google multiplied by the average number of singleton search terms occurring on pages; f(x) and f(y) are the number of hits for search terms x and y, respectively; and f(x, y) is the number of web pages on which both x and y occur.

If the $NGD(x,y)=0$ then x and y are viewed as alike as possible, but if $NGD(x,y)\geq 1$ then x and y are very different. If the two search terms x and y never occur together on the same web page, but do occur separately, the NGD between them is infinite. If both terms always occur together, their NGD is zero.

Example: On 9 April 2013, googling for "Shakespeare" gave 130,000,000 hits; googling for "Macbeth" gave 26,000,000 hits; and googling for "Shakespeare Macbeth" gave 20,800,000 hits. The number of pages indexed by Google was estimated by the number of hits of the search term "the" which was 25,270,000,000 hits. Assuming there are about 1,000 search terms on the average page this gives $N=25,270,000,000,000$ . Hence

NGD(Shakespeare,Macbeth)=(26.95-24.31)/(44.52-24.63)=0.13

.

"Shakespeare" and "Macbeth" are very much alike according to the relative semantics supplied by Google.

^ R.L. Cilibrasi; P.M.B. Vitanyi (2007). "The Google similarity distance". IEEE Trans. Knowledge and Data Engineering. 19 (3): 370–383. arXiv:cs/0412098. doi:10.1109/TKDE.2007.48. S2CID 59777.

[CV07-1] R.L. Cilibrasi; P.M.B. Vitanyi (2007). "The Google similarity distance". IEEE Trans. Knowledge and Data Engineering. 19 (3): 370–383. arXiv:cs/0412098. doi:10.1109/TKDE.2007.48. S2CID 59777.

[1]